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^ ' Abstract 

A neural network works as an associative memory device if it has large storage capacity and the 
quality of the retrieval is good enough. The learning and attractor abilities of the network both can 
be measured by the mutual information (MI), between patterns and retrieval states. This paper 
i—i | deals with a search for an optimal topology, of a Hebb network, in the sense of the maximal MI. 

We use small-world topology. The connectivity 7 ranges from an extremely diluted to the fully 
connected network; the randomness u> ranges from purely local to completely random neighbors. 
> It is found that, while stability implies an optimal M/(7, a;) at 7 op t(u;) — > 0, for the dynamics, the 

optimal topology holds at certain 7 opt > whenever < u> < 0.3. 

t-H ■ 

1 Introduction 

00 . 
■ 

The collective properties of attractor neural networks (ANN), such as the ability to perform as an 
associative memory, has been a subject of intensive research in the last couple of decades [1], dealing 
mainly with fully-connected topologies. More recently, the interest on ANN has been renewed by 
the study of more realistic architectures, such as small- world [2], [4] or scale-free [3], [15] models. The 
storage capacity a c and the overlap m with the memorized patterns are the most used measures of the 
retrieval ability for the Hopfield-Hebb networks [5], [6]. Comparatively less attention has been paid to 
. the study of the mutual information (MI) between stored patterns and the neural states [7] [8], although 

^ I neural networks are information processing machines. 

A reason for this relatively low interest is twofold: on the one hand, it is easier to deal with the 



global parameter m[a, £], than with MJ[p(cr|£)], a function of the conditional probability of neuron 
states a given the patterns £. This can be solved for the so called mean-field networks which satisfy 
the law of large numbers, hence MI is a function only of the macroscopic parameters m, and the load 
rate a = P/K (where P is the number of uncorrelated patterns, and K is the neuron connectivity). 
On the other hand, the load a is enough to measure the information if the overlap is close to m ~ 1, 
since in this case the information carried by any single binary neuron is almost 1 bit. It is true for a 
fully-connected (FC) network, for which the critical a F ~ 0.138 [5], with mf ~ 0.97 (with a sharp 
transition to m — > for larger a > a c ): in this case, the information rate is about i FC ~ 0.131, as 
can be seen in the left panel of Fig.l. There we show the overlap (upper) and information for several 
architectures. However, in the case of diluted networks the transition is smooth. In particular, the 
random extremely diluted (RED) network has load capacity a^ ED ~ 0.64[10] but the overlap falls 
continuously to m^ ED ~ 0, which yields null information at the transition, i^ ED ~ 0.0, as seen in 
right panel of Fig.l (dashed line). Such indetermination shows that one must search for the value of 
otmax corresponding to the maximal information MI max = MI(a max ), instead of a c . 
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Figure 1: The overlap m and the information j vs a for different architectures: fully-connected, r y FC = 1.0 
(left), moderately-diluted, 7 MD — 10~ 2 (center) and extremely-diluted, r y ED = 10~ 4 (right). Symbols represents 
simulation with initial overlap m° = 1 and \J\ — 40M, with local (stars, cu — 0.0), small- world (filled squares, 
u> = 0.2), and random (circles, u> — 1.0) connections. Lines are for theoretical results: solid, u> = 0.0, dotted, 
u> = 0.2, and dashed, lo = 1.0. In left, dashed line means averaging the simulation. 



We address the problem of searching for the optimal topology, in the sense of maximizing the 
mutual information. Using the graph framework [3], one can capture the main properties of a wide 
range of neural systems, with only 2 parameters: 7 = K/N, which is the average rate of links per 
neurons, where N is the network size, and u, which controls the rate of random links (among all 
neighbors). When 7 is large, the clustering coefficient is large (c ~ 1) and the mean-length-path 
between neurons is small (I ~ In N), whatever to is. When 7 is small, then if uj is too small, c ~ 1 and 
/ ~ N/K, but if it is about u ~ 0.1, the network behaves again as if 7 ~ 1, with c ~ 1 and / ~ ln(iV). 
This region, called small- world (SW), is rather usefull when one is interested to built networks where 
the information transmition is fast and efficient, with high capacity in presence of significant noise, 
but do not wants to spent too much wiring [17]. Small- world networks may model many biological 
systems [14]. For instance, in a brain local connections dominate in intracortex, while there are a few 
intercortical connections [13]. 

In Fig.l we show the overlap (upper) and information for several architectures. In the left panel, 
it is seen that the maximum information rate, i = MI /(K.N), of FC network is about = 0.135, 
while in the right panel, we show extremely-diluted networks (ED). The RED network (u> = 1.0) has 
^nmx ~ 0.223. The right panel of Fig.l plot also the overlap and the information for the local extremely 
diluted network (LED, to = 0.0), with = 0.0855, and a small- world extremely diluted network 

(SED, oj = 0.2), with i^f = 0.165. We see that the ED transitions are smooth. The central panel of 
Fig.l plot moderately diluted (MD) networks, which are commented later. Theoretical results fit well 
with the simulations, except for small uj, where theory underestimate it. Previous works about small- 
world attractor neural networks [12] studied only the overlap m(a), so no result about information 
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were known. 

Our main goal in this work is to solve the following question: how does the maximal information, 
imm(7)W) = i(a max ;j,uj) behaves with respect to the network topology? To our knowledge, up to 
now, there were no answer to this question. We will show that, near to the stationary retrieval states, 
for every value of the randomness u > 0, the extremely-diluted network, performs the best, j op t — ► 0. 
However, regarding the attractor basins, starting far from the patterns, the optimal topology holds for 
moderate j op t- For instance, if transients are taken in account, values of u ~ 0.1 lead to an optimal 
iopt{l) = imax(lopt,v) with ^ opt ~ 1(T 2 . 

The structure of the paper is the following: in the next section we review the information measures 
used in the calculations; in Sec. 3, we define the topology and neuro-dynamics model. The results are 
shown in Sec. 4, where we study retrieval by theory and simulation (with random patterns and with 
images); conclusions are drawn in last section. 

2 The Information Measures 

2.1 The Neural Channel 

The network state at a given time t is defined by a set of binary neurons, <?' = {a\ G {±1}, i = 1, N}. 
Accordingly, each pattern £ M = {£f £ {±1}, i = 1, N}, is a set of site-independent random variables, 
binary and uniformly distributed: = ±1) = 1/2. The network learns a set of independent patterns 

{e, fi = i,...,p}. 

The task of the neural channel is to retrieve a pattern (say, £) starting from a neuron state which 
is inside its attractor basin, B(£), i.e.: a € B(£) — > <j°° « £. This is achieved through a network 
dynamics, which couples neighbor neurons o,^Oj by the synaptic matrix J = {Jij} with cardinality 
| J | = N x K. 

2.2 The Overlap 

For the usual binary non-biased neurons model, the relevant order parameter is the overlap between 
the neural states and a given pattern: 

<-^E«, (i) 

i 

at the time step t. Note that both positive £ and negative — £ patterns, carry the same information, 
so the absolute value of the overlap measures the retrieval quality: \m\ ~ 1 means a good retrieval. 
Alternatively, one can measure the error in retrieving using the Hamming distance: D^j = J2i l£f — 
oS| 2 = 2(l-m£). 

Together with the overlap, one needs a measure of the load, which is the rate of pattern bits per 
synapses used to store them. Since the synapses and patterns are independent, the load is given by 
a = \m\/\3\ = (PN)/(NK)=P/K. 

We require our network to have long-range interactions. Therefore, we regard a mean-field network 
(MFN), the distribution of the states is site-independent, so every spatial correlation such as (o~iGj) — 
(<Tj)((jj) can be neglected, which is reasonable in the asymptotic limit K, N — > oo. Hence the condition 
of the law of large numbers, are fulfilled. At a given time step of the dynamical process, the network 
state can be described by one particular overlap, let say = m^. The order parameters can 
thus be written, when — > oo, as m 1 = (c*^)^^. The brackets represent average over the joint 
distribution p(cr|£), for a single neuron (we can drop the index i). This macroscopic variable describes 
the information processing of the network, at a given time step t of the dynamics. Along with this 
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signal parameter, the residual P — 1 microscopic overlaps yield the cross-talk noisy, its statistics 
complete the network macro-dynamics. 



2.3 Mutual Information 

For a long-range system, it is enough to observe the distribution of a single neuron in order to know 
the global distribution [8]. This is given by the conditional probability of having the neuron in a 
state a, at each (unspecified) time step t, given that in the same site the pattern being retrieved is 
£. For the binary network we are considering, p(cr|£) = (1 + ma£,)5(a 2 — 1), [9] where the overlap is 

The joint distribution of p(o~, £) is interpreted as an ensemble distribution for the neuron states 
{dj} and inputs In the conditional probability, p(cr|£), all type of noise in the retrieval process of 

the input pattern through the network (both from environment and over the dynamical process itself) 
is enclosed. 

With the above expressions and p(a) = J2^P(0p( a \0 = K ' 2 ~ 1)> we can calculate the MI [8], a 
quantity used to measure the prediction that an observer at the output (cr) can do about the input 
(£ M ) ( we drop the time index t). It reads M/[cr;£] = S[a] — S[a\£], where S[a] is the entropy and 
S'[cr|^] is the conditional entropy. We use binary logarithms to measure the information in bits. The 
entropies are [9]: 

„ r i >n 1 + m 1 + m 1 — m, 1 — m 
S[<r\€\ = — lo g2 — j — lo g2 —2-' 

S[a] = l[bit]. (2) 

We define the information rate as 

i(a, m) = MI[a\{in}]/\3\ = aMI[a; f], (3) 

since for independent neurons and patterns, M/[<7|{£//}] = ^2i^MI\ai\^]. When the network ap- 
proaches its saturation limit a c , the states can not remain close to the patterns, then m c is usually 
small. So, while the number of patterns increase, the information per pattern decreases. Therefore, 
information i(a,m) is a non-monotonic function of the overlap and load rate, see Fig.l, which reaches 
its maximum value imax — iipt-max) &t some value of the load ct max - 



3 The Model 

3.1 The Network Topology 

The synaptic couplings are Jij = CijWij, where the connectivity matrix has a local and a random parts, 
{Cij = Cjj + Cl'j}, and W are synaptic weights. The local part connects the K n nearest neighbors, 
Cfj = J2keV <^(* ~ 3 ~ k)> with V = {1, K n } in the asymmetric case, on a closed ring. The random 
part consists of independent random variables {C^}, distributed with probability p(C^- = 1) = c r , 
and C\j = otherwise, with c r = K r /N, where K r is the mean number of random connections of 
a single neuron. Hence, the neuron connectivity is K = K n + K r . The network topology is then 
characterized by two parameters: the connectivity ratio, defined as 7 = K/N, and the randomness 
ratio, oj = K r /K. The u> plays the role of rewiring probability in the small-world model (SW) [2]. Our 
model was proposed by Newman and Watts [19], which has the advantage of avoiding disconneting 
the graph. 

Note that the topology C can be defined by an adjacency list connecting neighbors, ik, k = 1, K, 
with Cij = 1 : j = ik- So the storage cost of this network is |J| = N ■ K. Hence, the information 
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Figure 2: Maximal information imax — 
) vs 7. Theoretical results for 
the stationary states, with several val- 
ues of randomness u>. 
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Figure 3: i m ax vs 7. Simulation with 



= 0.1 and 



0.3. Dynamics stop 



at t = 10 (Plus dots, Solid line) or at 
t = 100 (Circles, Dashed line). 



is i = aMI, Eq.(3), where the load rate is scaled as a = P/K. The learning algorithm updates W, 
according to the Hebb rule 

w tj = wtf 1 + ^eref (4) 

The network starts at Wfj = 0, and after \x = P = aK learning steps, it reaches a value Wij = 
Tc Cf The learning stage is a slow dynamics, being stationary- like in the time scale of the much 
faster retrieval stage, we define in the following. 

3.2 The Neural Dynamics 

The neural states, o\ S {±1}, are updated according to the stochastic parallel dynamics: 

= sign(/4 +Tx), hj = J2 * = h - N ( 5 ) 

3 

where x is a normalized random variable and T is the temperature-like environmental noise. In the 
case of symmetric synaptic couplings, Jij = Jji, an energy function H s = —J2(i,j)Jij (J i (7 j can be 
defined, whose minima are the stable states of the dynamics Eq.(5). 

In the present paper, we work out the asymmetric network by simulation (no constraints Jij = Jji). 
The theory was carried out for symmetric networks. As it is seen in Fig.l, theory and simulation shows 
similar results, except for local networks (theory underestimate a max , where the symmetry may play 
some role. We restrict our analysis also for the deterministic dynamics (T = 0). The stochastic 
macro-dynamics comes from the extensive number of learned patterns, P = aK. 
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4 Results 



We studied the information for the stationary and dynamical states of the network were studied as a 
function of the topological parameters, to and 7. A sample of the results for simulation and theory is 
shown in Fig.l, where the stationary states of the overlap and information are plotted for the FC, MD 
and ED arquitetures. It can be seen that information increases with dilution and with randomness of 
the network. A reason for this behavior is that dilution decreases the correlation due to the interference 
between patterns. However, dilution also increases the mean-path-length of the network, thus, if the 
connections are local, the information flows slowly over the network. Hence, the neuron states can be 
eventually trapped in noisy patterns. So, i ma x is small for to ~ even if 7 = 1CT 4 . 

4.1 Theory: Stationary States 

Following to the Gardner calculations [10], at temperature T=0 the MFN approximation gives the 
fixed point equations: 

m = erf(m/\/ra), (6) 

X = 2^(m/v / ra)/v / r"; (7) 

00 

r = ]T a k (k + l) X k , a k = 7 Tr[(C/K) fe + 2 ] (8) 

k=0 

with erf(x) = 2 f£ ip(z)dz, ip(z) = e~ z ' 2 l 2 /V2~n. The parameter a k is the probability of existence of 
cycle of length k + 2 in the connectivity graph. The a k can be calculated either by using Monte Carlo 
[16], or by an analytical approach, which gives a k ~ EmI dO[ P (9)] k e im6 , where p{6) is the Fourier 
transform of the probability of links, p(Cij). For an RED and FC networks one recover the known 
results for r RED = 1 and r = 1/(1 — x) 2 respectively [1]. 

The theoretical dependence of the information on the load, for FC, MD and ED networks, with 
local, small-world and random connections, are plotted in the fat lines in Fig.l. A comparison between 
theory and simulation is also given in Fig.l. It can be seen that both results agree for most to > 0, 
but theory fails for 00 = 0. One reason is that theory uses symmetric constraint, while simulation was 
carried out with asymmetric synapsis. Figure 2 shows their maxima i(a max ) vs. the parameters (to, 7). 
It is seen that the optimal is at to — ► 1,7 — > 0. This implies that the best topology for information 
(stationary states) is the extreme diluted network, with purely random connectivity. 

4.2 Simulation: Attractors and Transients 

We have studied the behavior of the network varying the range of connectivity 7 and randomness 00. 
We used Eq.(5). Both local and random connections are asymmetric. The simulation was carried out 
with N xK = 36- 10 6 synapses, storing an adjacency list as data structure, instead of Jij. For instance, 
with 7 = K/N = 0.01, we used K = 600, N = 6 ■ 10 4 . In [12] the authors use K = 50, N = 5 • 10 3 , 
which is far from asymptotic limit. 

We studied the network by searching for the stability properties and transients of the neuron 
dynamics. To look for stability, we started the network at some pattern (with initial overlap m° = 1.0), 
and wait until it stays or leave it after a flag time step t = tf (unless it converges to a fixed point 
m* before t = tf). When we check transients, we start with m° = 0.1, and stop the dynamics at the 
time tf. Usually, tf = 20 parallel (all neurons) updates is a large enough delay for retrieval. Indeed 
in most case far before the saturation, after tf = 4 the network end up in a pattern, however, near 
c*max, even after tf = 100 the network has not yet relaxed. 



6 



t=20; |J|=40IVI 
Y=1 OO y=10 1 Y =10 2 T =10 3 T=10 

o.i s I — , — i — , — i — , — i — , — | — , — i — , — i — , — i — , — | , , — 




Figure 4: The information vs the load, i(a), with connectivities from 7 = 1.0 (left) to 7 = 1CP 4 (right). 
N.K = 4.10 7 . In the upper panel, the simulation starts with m° = 1.0, in the lower panel, with m° = 0.1. 
Retrieval stops at tf = 20. The randomness are u> = 0.0 (open circles), u) = 0.1 (plus) and 10 = 0.2 (triangles). 
The solid line for co = 0.1 with m° = 0.1 is a guide to the eyes. 



In first place, we checked for the stability properties of the network: the neuron states start precisely 
at a given pattern £^ (which changes at each learned step /i). The initial overlap is ttt-q = 1.0, so, after 
t m < 20 time steps in retrieving, the information i(a, m; 7, u) for final overlap is calculated. We plot 
it as a function of a, and its maximum imax = ^(^maajjT?^) ^ evaluated. ^Ve averaged over a window 
in the axis of P, usually 5P = 25. This is repeated for various values of the connectivity ratio 7 and 
randomness u parameters. The results are in the upper panels of Fig. 4. 

Second, we checked for the retrieval properties: the neuron states start far from a learned pattern, 
but inside its basin of attraction, a £ B(^). The initial configuration is chosen with distribution: 
p(a° = ±£^|£ M ) = (1 ± m°)/2, for all neurons (so we avoid a bias between local/random neighbors). 
The initial overlap is now m° = 0.1, and after tf < 20 steps, the information i(a, m; 7, uj) is calculated. 
The results are in the lower panels of Fig. 4. The first observation now is that the maximal information 
imax{l'i^) increases with dilution (smaller 7) if the network is more random, u ~ 1, while it decreases 
with dilution if the network is more local, u> ~ 0. 

The comparison between upper (m° = 1.0) and lower parts of Fig. 4, shows that the non-monotonic 
behavior of the information with dilution and randomness, is stronger for the retrieval (m° = 0.1) than 
for the stability properties (m° = 1.0). One can understand this in terms of the basins of attraction. 
Random topologies have very deep attractors, specially if the network is diluted enough, while regular 
topologies almost lose their retrieval abilities with dilution. However, since the basins becomes rougher 
with dilution, then network takes longer to reach the attractor. Hence, the competition between depth- 
roughness is won by the more robust MD networks. 

Each maximal i m ax{l\u) in Fig. 4 is plotted in Fig. 5. We see that, for intermediate values of 
the randomness parameter < u < 0.3 there is an optimal information respect to the dilution 7, if 
dynamics is truncated. We observe that the optimal i op t = imaxilopt'-,^) is shifted to the left (stronger 
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dilution) when the randomness u> of the network increases. 

For instance, with uj = 0.1, the optimal is at 7 ~ 0.020 while with uj = 0.2, it is 7 ~ 0.005. This 
result does not change qualitatively with the flag time, but if the dynamics is truncated early, the 
optimal 7 pt, for a fixed uj, is shifted to more connected networks. However, the behavior depends 
strongly on the initial condition: respect to uiq = 0.1, where the maximal are pronounced, with 
rriQ = 1.0, the dependence on the topology becomes almost flat. We see also that for uj > 0.3 there is 
no intermediate optimal topology. It is worth to note that the simulation converges to the theoretical 
results if mo = 1.0 when t — > 00. 

4.3 Simulation with Images 

The simulations presented so far use artificial patterns randomly generated. In order to check if our 
results are robust against possibly correlations existent in realistic patterns, we test the algorithm 
with images. We see that the Scime non-niono tonic behavior for imax 

(7) is observed here. 

We have checked the results by using data derived from the Waterloo image database. We are 
working with square shaped patches. In order to use Hebb-like non-sparse code binary network and 
still preserve the structure of the image we process the images preserving the edges, by applying edge 
filter. Each pixel of the patch represents a different neuron. The number of connections is up to 
N x K = 3 • 10 5 and the feasible connectivities (more than 3 patterns) are 7 > 0.002. 

Note that the procedure, strictly speaking, does not guarantee the conditions for the distribution of 
£, because neither p{£ = ±1) is uniform (due to the threshold in large blocks), nor £j are uncorrelated 
(due to image edges). 

We are choosing at random the origin of the patch and the image to be used from the available 
12 images. The topology of the network is a ring with small world topology. The results of the 
simulation, using Chen filter, are shown in Fig. 3. The optimal connectivity with uj = 0.1 and tf = 10 
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is found to be j op t ~ 0.03. The fluctuation now are much larger than with random patterns, due to 
correlation and small network size. In the stationary states, tf — > oo, the optimal connectivity remains 
at jopt ~ 0.03, with i opt ~ 0.165. The results agree qualitatively with simulation for random patterns, 
Fig.4, where the initial overlaps are m° = 0.1 and m° = 1.0 (in Fig.3 it is always m° = 0.3). 

5 Conclusions 

In this paper we have studied the dependence of the information capacity with the topology for an 
attractor neural network. We calculated the mutual information for a Hebb model, for storing binary 
patterns, varying the connectivity (7) and randomness {to) parameters, and obtained the maximal 
respect to a, imax(l,w) = i(a max ;^,uj). Then we look at the optimal topology, ^y op t in the sense of 
the information, i opt = imaxilopt,^)- We presented stationary and transient states. The main result 
is that larg er uj always leads to higher information imax • 

From the stability calculations, the stationary optimal topology, is the extremely diluted (RED) 
network. Dynamics shows, however, that this is not true: we found there is an intermediate optimal 
"f op t, for any fixed < u < 0.3. This can be understood regarding the shape of the attractors. The ED 
waits much longer for the retrieval than more connected networks do, so the neurons can be trapped in 
spurious states with vanishing information. We found there is an intermediate optimal 7 op t, whenever 
the retrieval is truncated, and it remains up to the stationary states. 

Both in nature and in technological approaches to neural devices, dynamics is an essential issue for 
information process. So, an optimized topology holds in any practical purpose, even if no attemption 
is payed to wiring or other energetic costs of random links [17]. The reason is a competition between 
the broadness (larger storage capacity) and roughness (slower retrieval speed) of the attraction basins. 

We believe that the maximization of information respect to the topology could be a biological 
criterium (where non-equilibrium phenomena are relevant) to build real neural networks. We expect 
that the same dependence should happens for more structured networks and learning rules. 

Acknowledgments Work supported by grants TIC01-572, TIN2004-07676-C01-01, BFI2003- 
07276, TIN2004-04363-C03-03 from MCyT, Spain. 
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